Picture for Rohin Shah

Rohin Shah

Google DeepMind

Realistic honeypot evaluations for scheming propensity

Add code
May 28, 2026
Viaarxiv icon

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth

Add code
Mar 10, 2026
Viaarxiv icon

Building Production-Ready Probes For Gemini

Add code
Jan 16, 2026
Viaarxiv icon

Consistency Training Helps Stop Sycophancy and Jailbreaks

Add code
Oct 31, 2025
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon

Evaluating Frontier Models for Stealth and Situational Awareness

Add code
May 02, 2025
Viaarxiv icon

Evaluating the Goal-Directedness of Large Language Models

Add code
Apr 16, 2025
Figure 1 for Evaluating the Goal-Directedness of Large Language Models
Figure 2 for Evaluating the Goal-Directedness of Large Language Models
Figure 3 for Evaluating the Goal-Directedness of Large Language Models
Figure 4 for Evaluating the Goal-Directedness of Large Language Models
Viaarxiv icon

An Approach to Technical AGI Safety and Security

Add code
Apr 02, 2025
Viaarxiv icon

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Add code
Aug 09, 2024
Figure 1 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 2 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 3 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 4 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Viaarxiv icon

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Figure 1 for On scalable oversight with weak LLMs judging strong LLMs
Figure 2 for On scalable oversight with weak LLMs judging strong LLMs
Figure 3 for On scalable oversight with weak LLMs judging strong LLMs
Figure 4 for On scalable oversight with weak LLMs judging strong LLMs
Viaarxiv icon